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Abstract 

Statistical learning theory chiefly studies restricted 
hypothesis classes, particularly those with finite 
Vapnik-Chervonenkis (VC) dimension. The fun- 
damental quantity of interest is the sample com- 
plexity: the number of samples required to learn to 
a specified level of accuracy. Here we consider 
learning over the set of all computable labeling 
functions. Since the VC-dimension is infinite and 
a priori (uniform) bounds on the number of sam- 
ples are impossible, we let the learning algorithm 
decide when it has seen sufficient samples to have 
learned. We first show that learning in this setting 
is indeed possible, and develop a learning algo- 
rithm. We then show, however, that bounding sam- 
ple complexity independently of the distribution is 
impossible. Notably, this impossibility is entirely 
due to the requirement that the learning algorithm 
be computable, and not due to the statistical nature 
of the problem. 



1 Introduction 

Suppose we are trying to learn a difficult classification prob- 
lem: for example determining whether the given image con- 
tains a human face, or whether the MRI image shows a ma- 
lignant tumor, etc. We may first try to train a simple model 
such as a small neural network. If that fails, we may move 
on to other, potentially more complex, methods of classifi- 
cation such as support vector machines with different ker- 
nels, techniques to apply certain transformations to the data 
first, etc. Conventional statistical learning theory attempts 
to bound the number of samples needed to learn to a spec- 
ified level of accuracy for each of the above models (e.g. 
neural networks, support vector machines). Specifically, it is 
enough to bound the VC-dimension of the learning model to 
determine the number of samples to use B VC7 1 1 IBEHW891 . 
However, if we allow ourselves to change the model, then 
the VC-dimension of the overall learning algorithm is not fi- 
nite, and much of statistical learning theory does not directly 
apply. 



*I thank Erik Winfree and Matthew Cook for discussions and 
invaluable support. 



Accepting that much of the time the complexity of the 
model cannot be a priori bounded, Structural Risk Minimiza- 
tion [ |Vap98) explicitly considers a hierarchy of increasingly 
complex models. An alternative approach, and one we fol- 
low in this paper, is simply to consider a single learning 
model that includes all possible classification methods. 

We consider the unrestricted learning model consisting 
of all computable classifiers. Since the VC-dimension is 
clearly infinite, there are no uniform bounds (independent 
of the distribution and the target concept) on the number of 
samples needed to learn accurately [BEHW89]. Yet we still 
want to guarantee a desired level of accuracy. Rather than 
deciding on the number of samples a priori, it is natural to 
allow the learning algorithm to decide when it has seen suffi- 
ciently many labeled samples based on the training samples 
seen up to now and their labels. Since the above learning 
model includes any practical classification scheme, we term 
it universal (PAC-) learning. 

We first show that there is a computable learning al- 
gorithm in our universal setting. Then, in order to obtain 
bounds on the number of training samples that would be 
needed, we consider measuring sample complexity of the 
learning algorithm as a function of the unknown correct la- 
beling function (i.e. target concept). Although the correct 
labeling is unknown, this sample complexity measure could 
be used to compare learning algorithms speculatively: "if the 
target labeling were such and such, learning algorithm A re- 
quires fewer samples than learning algorithm B". By asking 
what is the largest sample size needed assuming the target 
labeling function is in a certain class, we could compare the 
sample complexity of the universal learner to a learner over 
the restricted class (e.g. with finite VC-dimension). 

However, we prove that it is impossible to bound the 
sample complexity of any computable universal learning al- 
gorithm, even as a function of the target concept. Depending 
on the distribution, any such bound will be exceeded with ar- 
bitrarily high probability. The impossibility of a distribution- 
independent bound is entirely due to the computability re- 
quirement. Indeed we show there is an uncomputable learn- 
ing procedure for which we bound the number of samples 
queried as a function of the unknown target concept, inde- 
pendently of the distribution. 

Our results imply that computable learning algorithms 
in the universal setting must "waste samples" in the sense 
of requiring more samples than is necessary for statistical 
reasons alone. 



2 Relation to Previous Work 

There is comparatively little work in statistical learning the- 
ory on learning arbitrary computable classifiers compared to 
the volume of research on learning in more restricted set- 
tings. Computational learning theory (aka PAC-learning) 
requires learning algorithms to be efficient in the sense of 
running in polynomial time of certain parameters [ Val84, 
IKV94I . That work generally restricts learning to very lim- 
ited concept/hypothesis spaces such as perceptrons, DNF ex- 
pressions, limited-weight neural networks, etc. The purely 
statistical learning theory paradigm ignores issues of com- 
putability |VC71||Vap 981. Work on learning arbitrary com- 
putable functions is mostly in the "learning in the limit" 
paradigm |Gol67 Ang88|, in which the goal of learning is 
to eventually converge to the perfectly correct hypothesis as 
opposed to approximating it with an approximately correct 
hypothesis. 

The idea of allowing the learner to ask for a varying num- 
ber of training samples based on the ones previously seen 
was studied before in statistical learning theory [LMR88 
|BI94l Linial et al MLMR88H called this model "dynamic 
sampling" and showed that dynamic sampling allows learn- 
ing with a hypothesis space of infinite VC-dimension if all 
hypotheses can be enumerated. This is essentially Theorem|4] 
of our paper. However, the hypothesis space of all com- 
putable functions cannot be enumerated by any algorithm, 
and thus these results do not directly imply the existence of 
a learning algorithm in our setting. 

Our proof technique for establishing positive results 
(Theorem [2]) is parallel evaluation of all hypotheses, and is 
based on Levin's universal search IILev731 . In learning the- 
ory, Levin's universal search was previously used by Gol- 
dreich and Ron [GR97| to evaluate all learning algorithms in 
parallel and obtain an algorithm with asymptotically optimal 
computation time. 

The main negative result of this paper is showing the ab- 
sence of distribution independent bounds on sample com- 
plexity for computable universal learning algorithms (The- 
orem|5). Recently Ryabko [Rya05| considered learning ar- 
bitrary computable classifiers, albeit in a setting where the 
number of samples for the learning algorithm is externally 
chosen. He demonstrated a computational difficulty in deter- 
mining the number of samples needed: it grows faster than 
any computable function of the length of the target concept. 
In contrast, we prove that distribution-independent bounds 
do not exist altogether for computable learning algorithms in 
our setting. 

3 Definitions 

The sample space X is the universe of possible points over 
which learning occurs. Here we will largely suppose the 
sample space X is the set of all finite binary strings {0, 1}*. 
A concept space C and hypothesis space H are sets of 
boolean-valued functions over X, which are said to label 
points x G X as 0/1. The concept space C is the set of all 
possible labeling functions that our learning algorithm may 
be asked to learn from. In each learning scenario, there is 
some unknown target concept c E C that represents the de- 
sired way of labeling points. There is also an unknown sam- 



ple distribution D over X. The learning algorithm chooses 
a hypothesis h G H based on iid samples drawn from D and 
labeled according to the target concept c. Since we cannot 
hope to distinguish between a hypothesis that is always cor- 
rect and one that is correct most of the time, we adopt the 
"probably approximately correct" [ Val84| goal of producing 
with high probability (1 — 5) a hypothesis h such that the 
probability over x ~ D that h(x) ^ c(x) is small (e). 

Here we will mostly consider the concept space C to be 
the set of all total recursive functions X — » {0, 1}. We 
say that this is a universal learning setting because C in- 
cludes any practical classification scheme. We will mostly 
consider the hypothesis space to be the set of all partial re- 
cursive functions X — » {0, 1, _L}, where _L indicates failure 
to halt. From PAC learning it is known that sometimes it 
helps to use different concept and hypothesis classes, if one 
desires the learning algorithm to be efficient [PV88 |. In a 
related way, allowing our algorithm to output a partial recur- 
sive function that may not halt on all inputs seems to per- 
mit learning (e.g. Theorem[2]i. Abusing notation, c £ Cor 
h G H will refer to either the function or to a representation 
of that function as a program. Similarly C and H will refer 
to the sets of functions or to the sets of representations of the 
corresponding functions. We assume all programs are writ- 
ten in some fixed alphabet and are interpreted by some fixed 
universal Turing machine. If h is a partial recursive function 
and h(x) = _L then by convention h(x) ^ h'(x) for any 
partial recursive function h! (even if h'(x) = _L also). 

We can now define what we mean by a learning algo- 
rithm: 

Definition 1 Algorithm A is a learning algorithm over sam- 
ple space X, concept space C, and hypothesis space H if: 

• (syntactic requirements) A takes two inputs 8 G (0, 1) 
and e G (0, 1/2), queries an oracle for pairs in X x 
{0, 1}, and if A halts it outputs a hypothesis h G H. 

• (semantic requirements) For any 8,e, for any concept 
c G C, and distribution D over X, if the oracle returns 
pairs (x,c(x)) for x drawn iid from D, then A always 
halts, and with probability at least 1 — 8 outputs a hypo- 
thesis h such that Pr x ^E>[h(x) ^ c(x)] < e. 

The always halting requirement seems a nice property of 
the learning algorithm and indeed the learning algorithm we 
develop (Theorem|2]i will halt for any concept and sequence 
of samples. However, relaxing this requirement to allow a 
non-zero probability that the learning algorithm queries the 
oracle for infinitely many samples does not change our nega- 
tive results (Theorem|5]l, as long as a finite number of oracle 
calls implies halting. 

The fundamental notion in statistical learning theory is 
that of sample complexity. Since the VC-dimension of 
our hypothesis space is infinite, there is no uniform bound 
m(8,e) on the number of samples needed to learn to the 
8, e level of accuracy. We will consider the question of 
whether for a given learning algorithm there is a distribution- 
independent bound m(c,8,£) on the number of samples 
queried from the oracle where c G C is the target hypo- 
thesis. In other words the bound is allowed to depend on the 
target concept c but not on the sample distribution D. Such a 



bound may be satisfied with certainty, or satisfied with high 
probability over the learning samples. 

4 Results 

We first show that there is a computable learning algorithm 
in our setting. 

Theorem 2 There is a learning algorithm over sample 
space X of all finite binary strings, hypothesis space H of 
all partial recursive functions, and concept space C of all 
total recursive functions. 

In order to prove this theorem we need the following 
lemma. Results equivalent to this lemma can be found 
in ILMR88I . 

Lemma 3 Let X be any sample space and D be any 
distribution over X. Fix any function c : X — > 
{0,1}. Suppose hypothesis space H is countable, and 
let h\,h2,--- be some ordering of H. For any 8,e, 
let m(i) = |"(21ni + ln(l/<5) + ln(7r 2 /6))/e]. Suppose 
Xi, #2> • • • is an infinite sequence ofiid samples drawn from 
D. Then the probability that there exists hi £ H such 
that Pr x ^]j[hi(x) ^ c(ar)] > e, but hi agrees with c on 
x±,X2, ■ . ■ , x m (i)i is less than 8. 

Proof: The probability that a particular hi with error prob- 
ability Pr xr ^D[hi(x) ^ c(x)] > e gets m(i) i.i.d. instances 
drawn from D correct is less than (1 — g) m ( l ) < e~ TO W E < 
(6/ii 2 )(8 /i 2 ). By the union bound, the probability that any 
hi with error probability greater than e gets m(i) instances 
correct is less than ^^l 1 (6/7r 2 )((5/i 2 ) = 8. ■ 

Proof of Theorem [2} Let hi, h%, ■ ■ ■ be a recursive enu- 
meration of H (for example in lexicographic order). For the 
given 8, e, let m(i) be defined as in Lemma[3] The learning 
algorithm computes infinitely many threads 1,2,... running 
in parallel. This can be done by a standard dovetailing tech- 
nique. (For example use the following schedule: for k = 1 
to infinity, for i — 1 to k, perform step k — i + 1 of thread 
i.) Thread i sequentially checks whether hi(x\) = c(x\), 
hi(x 2 ) = c(x 2 ), hi(x m{i} ) = c(x m(i )), exiting if a 
check fails. If all m(i) checks pass, thread i terminates and 
outputs hi. The learning algorithm queries the oracle as nec- 
essary for new learning samples and their labeling. The over- 
all algorithm terminates as soon as some thread outputs an 
hi, and outputs this hypothesis. By Lemma [3] with proba- 
bility at least 1 — 8, this hi has error probability less than e. 
Further, since C C H, the learning algorithm will always 
terminate. ■ 

Note that it seems necessary to expand the hypothesis 
space to include all partial recursive functions because the 
concept space of total recursive functions does not have a 
recursive enumeration (it is uncomputable whether a given 
program is total recursive or not). 

We will see in Theorem [5] that there is no bound 
m(c, 8, e) on the number of samples queried by any com- 
putable learning algorithm in our setting. Let us obtain some 
intuition for why that is true for the above learning algorithm. 
Then we will contrast this to the case of an uncomputable 
learning algorithm. 



In essence, we can make the above learning algorithm 
query for more samples than is necessary for statistical rea- 
sons alone. Intuitively, suppose that an hi* coming early in 
the ordering is always correct but takes a very long time to 
compute. The learning algorithm cannot wait for this hi* to 
finish, because it does not know that any particular hi will 
ever halt. At some point it has to start testing hi's that come 
later in the ordering and that have larger m(i)'s. Testing 
these requires more learning samples than m(i*). 

If we can know which hi's are safe to skip over since they 
don't halt, and for which hi's we should wait, then the above 
problem is solved. Indeed, the following theorem shows that 
there is no statistical reason why a distribution-independent 
bound m(c, 6, e) is impossible. The theorem presents a well 
defined method of learning (albeit an uncomputable one) for 
which there exists such a bound, and this bound is satisfied 
with certainty. Below, the halting oracle gives 0/1 answers to 
questions of the form (h,x) where h S H,x S X such that 
a 1 answer indicates that h(x) halts and a answer indicates 
it does not; the answers are clearly uncomputable. 

Theorem 4 If a learning algorithm is allowed to query the 
halting oracle, then there is a learning algorithm over sam- 
ple space X of all finite binary strings, hypothesis space H 
of all partial recursive functions, and concept space C of all 
total recursive functions, and a function m : C x (0,1) x 
(0, 1/2) —> N, such that for any approximation parameters 
8, e, any target concept c G C, and any distribution D over 
X, the learning algorithm uses at most m(c, 8, e) training 
samples. 

Proof: Rather than dovetailing as is done for the computable 
learning algorithm (Theorem |2), we can sequentially test 
every hi on samples x\, . . ., x m u> because we can deter- 
mine whether hi halts on a given input. Since c = hi* 
for some hi* £ H, the hypothesis hi we output will al- 
ways satisfy i < i* , and therefore we will require at most 
m(i*) = [(21n(i*) +ln(l/<5) + ln(7r 2 /6))/£] samples. ■ 

We now show that for any computable learning algo- 
rithm, and any possible sample bound m(c, 6, e), there is a 
target concept c and a sample distribution such that this sam- 
ple bound is violated with high probability. The probability 
of violation can be made arbitrarily close to 1— 2(5 +(1— 6)e) 
(which approaches 1 as 8, e — > 0). In fact this theorem is 
stronger: it shows that given a learning algorithm, without 
varying the target concept, but just by varying the distribu- 
tion it is possible to make the algorithm ask for arbitrarily 
many learning samples with high probability. 

Theorem 5 For any learning algorithm over sample space 
X of all finite binary strings, hypothesis space H of all 
partial recursive functions, and concept space C of all to- 
tal recursive functions, there is a target concept c € C, 
such that for any approximation parameters 8, e, for any 
p < 1 — 2(8 + (1 — 8)e), and for any sample bound m G N 
there is a distribution D over X, such that the learning al- 
gorithm uses more than m training samples with probability 
at least p. 

The key difference between a computable and an uncom- 
putable learning algorithm, is that a concept can simulate 



a computable one. By simulating the learning algorithm, a 
concept can choose to behave in way that is bad for the learn- 
ing algorithm's sample complexity. 

To prove the above theorem, we will first need the fol- 
lowing lemma. The lemma essentially shows a situation such 
that any learning algorithm according to our definition must 
query for more than m learning samples with high probabil- 
ity when the target concept is chosen adversarily. The lemma 
is true even without requiring the learning algorithm to be 
computable. Note that the lemma does not directly imply 
the theorem above, even in its weaker form, because in or- 
der to increase the number of learning samples that are likely 
queried by the learning algorithm, we have to change the tar- 
get concept. Since m(c, 5, e) is a function of c, there is no 
guarantee that the bound doesn't become larger as well. 

Lemma 6 Let X be a set of d points, and let C be the set 
of all labelings of X. Let D be a uniform distribution over 
X. Suppose A is a learning algorithm over sample space X, 
concept and hypothesis space C. For any accuracy param- 
eters 8, e and any m < d, there is a concept c £ C such 
that when the oracle draws from D labeled according to c 
the probability that A samples more than m points is at least 

^ _ 2rf(<5+(l-<5)e) 
d—m 

Proof: We use the probabilistic method to find a particularly 
bad concept c*. Suppose we do not start with a fixed target 
concept c, but draw it uniformly from C. In other words, c 
is determined by values {c(x)} X £X drawn uniformly from 
{0, 1}. Given some xi,... ,x m , c(xi), . . . ,c(x m ), and x 
{xi, . . . , x m }, the value of c(x) is a fair coin flip. Thus if on 
xi, . . . , x m labeled by c(x%), . . . , c(x m ), A outputs a hypo- 
thesis without asking for more samples, then the hypothesis 
is incorrect on x with probability 1/2. If we now let x vary, 
the probability that the hypothesis is incorrect on x is at least 
(1/2) {d — m)/d since there are at least d — m points not in 
x%, . . . , x m . Now suppose for any c the probability that A 
samples more than rn points is at most p. Then the uncondi- 
tional probability that the hypothesis output by A is incorrect 
on a random sample point is at least (1 — p)(l/2)(d — m)/d. 
This implies that there is a concept c* G C such that the 
probability that the hypothesis output by A is incorrect on a 
random sample point is at least (1 — p)(l/2)(d — m) / d. 

Since A is a learning algorithm, when we use c* to la- 
bel the training points, and use accuracy parameters 5, e, the 
probability that the hypothesis produced by A has error prob- 
ability greater than e is at most S. If we make the worst case 
assumption that whenever the error probability of the hypo- 
thesis is larger than e it is exactly 1, and otherwise the error 
probability is exactly e, then the probability that the hypo- 
thesis output by A is incorrect on a random sample point is 
at most 5 ■ 1 + (1 - 5)e. Thus (1 - p)(l/2)(d -m)/d< 

5+ {I- 5)e, implying that p>l- 2d{S ^~ S)e) ■ ■ 

Now in order to prove Theorem [5] we essentially show 
that there is some fixed concept c* that behaves as the bad 
c's in arbitrary instances of Lemma|6] 

Proof of Theorem Consider the following program P : 
{0,1}* — * {0,1}. First it interprets the given string x G 



{0,1}* as a tuple (5,s,m,d,i) for 5 e (0,1), e G (0,1/2) 
and m, d,i e N using some fixed one-to-one encoding of 
such tuples as binary strings. If x cannot be decoded appro- 
priately, or if i > d then P returns 0. Otherwise, for these 
S, e, m, d, let X C {0, 1}* be the set of d strings which are 
interpreted as {(6, s, m, d, 1), . . . , (6, e, m, d, d)}, and let D 
be a uniform distribution over X and elsewhere. Let C 
be the set of all possible labelings of X. For each labeling 
c G C, program P computes the probability pc that A given 
accuracy parameters 5, e, queries for more than m sample 
points if points are drawn from D labeled according to c. 
For each c, this requires simulating A for at most d m differ- 
ent sequences of sample points. Let c* = argmaxg ec =,{pe}, 
breaking ties in some fixed way. Finally P outputs c*(x). 

Observe that P is total recursive since A spends a fi- 
nite time on any finite sequence of sample points. (This is a 
weaker condition than the always halting requirement of our 
definition of a learning algorithm.) Thus P is some c* G C. 
Further, for any 5,e,m,d, on all points (S, e,m, d,i) for 
i < d, P finds the same c*, and thus on these points c* acts 
like this c*. By Lemma|6] if m < d then this c* has the prop- 
erty that p£' > 1 — 2d ^ 5 ^} m * 5 " >£ ' > ■ Therefore, if A is given ac- 
curacy parameters 5, e, the target concept is c*, and the distri- 
bution D is uniform over {(6, e, rn, d, 1), . . . , (6, e, m, d, d)} 
for some d G N such that m < d, then the probability that 

A requests more than m samples is at least 1 — _ 
Since we can choose D such that d is large enough, we ob- 
tain the desired result. ■ 



5 Conclusion 

We have shown that learning arbitrary computable classifiers 
is possible in the statistical learning paradigm. However for 
any computable learning algorithm, the number of samples 
required to learn to a desired level of accuracy may become 
arbitrarily large depending on the sample distribution. This 
is in contrast to uncomputable learning methods in the same 
universal setting whose sample complexity can be bounded 
independently of the distribution. 

Our results mean that there is a big price in terms of sam- 
ple complexity to be paid for the combination of universality 
and computability of the learner. Specifically, by tweaking 
the distribution we can make a computable universal learner 
arbitrarily worse than a restricted learning algorithm on a fi- 
nite VC-dimensional hypothesis space, or even an uncom- 
putable universal learner. 

While we have presented a single computable learning 
algorithm in our universal setting, one would like to develop 
a measure that would allow different learning algorithms to 
be compared to each other in terms of sample complexity. 
We have seen that sample complexity m(c, S, e) is not such 
a measure; is there a viable alternative? 

Finally, we have ignored computation time in our anal- 
ysis. As such, our learning algorithm is not likely to have 
practical significance. Integrating running time into the the- 
ory presented would be a critical extension. 
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